Group 11

Gabrielle Felicia Ariyanto - 2540134874
Natasha Hartanti Winata - 2502039176
Caroline Angelina Sunarya - 2501995093
Clarissa Octavia Tjandra - 2540120143
Agnes Calista - 2501980690

1. INTRODUCTION

Source Dataset : ‘https://www.kaggle.com/code/burhanykiyakoglu/predicting-house-prices/notebook’

House Data is a collection of data about house prices in 2015 at King County area of the United States. House prices will be predicted using the linear regression approach.
#Import File 
#The dataset from Kaggle is uploaded to github for access via the link
HouseData <- read.csv("https://raw.githubusercontent.com/GabrielleFeliciaA/house_price_data/main/kc_house_data.csv")

head(HouseData)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7129300520 20141013T000000  221900        3      1.00        1180     5650
## 2 6414100192 20141209T000000  538000        3      2.25        2570     7242
## 3 5631500400 20150225T000000  180000        2      1.00         770    10000
## 4 2487200875 20141209T000000  604000        4      3.00        1960     5000
## 5 1954400510 20150218T000000  510000        3      2.00        1680     8080
## 6 7237550310 20140512T000000 1225000        4      4.50        5420   101930
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1      1          0    0         3     7       1180             0     1955
## 2      2          0    0         3     7       2170           400     1951
## 3      1          0    0         3     6        770             0     1933
## 4      1          0    0         5     7       1050           910     1965
## 5      1          0    0         3     8       1680             0     1987
## 6      1          0    0         3    11       3890          1530     2001
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98178 47.5112 -122.257          1340       5650
## 2         1991   98125 47.7210 -122.319          1690       7639
## 3            0   98028 47.7379 -122.233          2720       8062
## 4            0   98136 47.5208 -122.393          1360       5000
## 5            0   98074 47.6168 -122.045          1800       7503
## 6            0   98053 47.6561 -122.005          4760     101930

2. Dataset Description

`id`              : (num) house id
`date`            : (chr) the date the house was sold
`price`           : (num) house price
`bedrooms`        : (int) the number of rooms in a house
`bathrooms`       : (num) the number of bathrooms in a house
`sqft_living`     : (int) house area
`sqft_lot`        : (int) land area
`floors`          : (num) the number of floors in the house
`waterfront`      : (int) does the house have a view of the water
`view`            : (int) view rating
`condition`       : (int) house condition
`grade`           : (int) overall assessment of the house
`sqft_above`      : (int) the area of the upper room of the house
`sqft_basement`   : (int) basement area
`yr_built`        : (int) year the house was built
`yr_renovated`    : (int) year the house was renovated
`zipcode`         : (int) zip code
`lat`             : (num) latitude coordinates
`long`            : (num) longitude coordinates
`sqft_living15`   : (int) the size of the house in 2015 if renovated.
`sqft_lot15`      : (int) size of land area in 2015 if renovated
#Packages
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.1.3
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(mapview)
## Warning: package 'mapview' was built under R version 4.1.3
library(caret)
## Warning: package 'caret' was built under R version 4.1.3
## 
## Attaching package: 'caret'
## The following object is masked from 'package:survival':
## 
##     cluster
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded

3. Exploring Dataset

dim(HouseData)
## [1] 21613    21
The result above indicated that there are 21613 observations (rows) and 21 variables (columns) of data.
str(HouseData)
## 'data.frame':    21613 obs. of  21 variables:
##  $ id           : num  7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : chr  "20141013T000000" "20141209T000000" "20150225T000000" "20141209T000000" ...
##  $ price        : num  221900 538000 180000 604000 510000 ...
##  $ bedrooms     : int  3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num  1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : int  1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
##  $ sqft_lot     : int  5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
##  $ floors       : num  1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : int  7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : int  1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
##  $ sqft_basement: int  0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : int  1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
##  $ yr_renovated : int  0 1991 0 0 0 0 0 0 0 0 ...
##  $ zipcode      : int  98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
##  $ lat          : num  47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ sqft_living15: int  1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
##  $ sqft_lot15   : int  5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...
There are three data types that were used in this data set, they are num, char, and int. To be specific, there are one (1) variable contained of char data type, six (6) variable contained of num data type, and fourteen (14) variable contained of int data type. The variable that consists of char data type is 'date'. The variables that consist of num data type is 'id', 'price', 'bathrooms', 'floors', 'lat', and 'long'. Other than all of mentioned before, other variables consist of int data type.
summary(HouseData)
##        id                date               price            bedrooms     
##  Min.   :1.000e+06   Length:21613       Min.   :  75000   Min.   : 0.000  
##  1st Qu.:2.123e+09   Class :character   1st Qu.: 321950   1st Qu.: 3.000  
##  Median :3.905e+09   Mode  :character   Median : 450000   Median : 3.000  
##  Mean   :4.580e+09                      Mean   : 540088   Mean   : 3.371  
##  3rd Qu.:7.309e+09                      3rd Qu.: 645000   3rd Qu.: 4.000  
##  Max.   :9.900e+09                      Max.   :7700000   Max.   :33.000  
##    bathrooms      sqft_living       sqft_lot           floors     
##  Min.   :0.000   Min.   :  290   Min.   :    520   Min.   :1.000  
##  1st Qu.:1.750   1st Qu.: 1427   1st Qu.:   5040   1st Qu.:1.000  
##  Median :2.250   Median : 1910   Median :   7618   Median :1.500  
##  Mean   :2.115   Mean   : 2080   Mean   :  15107   Mean   :1.494  
##  3rd Qu.:2.500   3rd Qu.: 2550   3rd Qu.:  10688   3rd Qu.:2.000  
##  Max.   :8.000   Max.   :13540   Max.   :1651359   Max.   :3.500  
##    waterfront            view          condition         grade       
##  Min.   :0.000000   Min.   :0.0000   Min.   :1.000   Min.   : 1.000  
##  1st Qu.:0.000000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.: 7.000  
##  Median :0.000000   Median :0.0000   Median :3.000   Median : 7.000  
##  Mean   :0.007542   Mean   :0.2343   Mean   :3.409   Mean   : 7.657  
##  3rd Qu.:0.000000   3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.: 8.000  
##  Max.   :1.000000   Max.   :4.0000   Max.   :5.000   Max.   :13.000  
##    sqft_above   sqft_basement       yr_built     yr_renovated   
##  Min.   : 290   Min.   :   0.0   Min.   :1900   Min.   :   0.0  
##  1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951   1st Qu.:   0.0  
##  Median :1560   Median :   0.0   Median :1975   Median :   0.0  
##  Mean   :1788   Mean   : 291.5   Mean   :1971   Mean   :  84.4  
##  3rd Qu.:2210   3rd Qu.: 560.0   3rd Qu.:1997   3rd Qu.:   0.0  
##  Max.   :9410   Max.   :4820.0   Max.   :2015   Max.   :2015.0  
##     zipcode           lat             long        sqft_living15 
##  Min.   :98001   Min.   :47.16   Min.   :-122.5   Min.   : 399  
##  1st Qu.:98033   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.:1490  
##  Median :98065   Median :47.57   Median :-122.2   Median :1840  
##  Mean   :98078   Mean   :47.56   Mean   :-122.2   Mean   :1987  
##  3rd Qu.:98118   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.:2360  
##  Max.   :98199   Max.   :47.78   Max.   :-121.3   Max.   :6210  
##    sqft_lot15    
##  Min.   :   651  
##  1st Qu.:  5100  
##  Median :  7620  
##  Mean   : 12768  
##  3rd Qu.: 10083  
##  Max.   :871200
The insights gained from the result above are:
- The maximum price of a house is 7700000 and the minimum price of a house is 75000.
- The oldest house(s) was/were built in 1900 and the newest house(s) was/were built in 2015.
- In the 'yr_renovated' variable, the minimum value is 0, the maximum value is 2015, the 1st quartile, median, and 3rd quartile all have the value 0. This may indicate that among all of the houses that were renovated, it is likely that either the houses were all renovated in the year 2015 or not renovated at all. 
describe(HouseData)
## HouseData 
## 
##  21  Variables      21613  Observations
## --------------------------------------------------------------------------------
## id 
##         n   missing  distinct      Info      Mean       Gmd       .05       .10 
##     21613         0     21436         1  4.58e+09 3.296e+09 5.125e+08 1.036e+09 
##       .25       .50       .75       .90       .95 
## 2.123e+09 3.905e+09 7.309e+09 8.732e+09 9.297e+09 
## 
## lowest :    1000102    1200019    1200021    2800031    3600057
## highest: 9842300095 9842300485 9842300540 9895000040 9900000190
## --------------------------------------------------------------------------------
## date 
##        n  missing distinct 
##    21613        0      372 
## 
## lowest : 20140502T000000 20140503T000000 20140504T000000 20140505T000000 20140506T000000
## highest: 20150513T000000 20150514T000000 20150515T000000 20150524T000000 20150527T000000
## --------------------------------------------------------------------------------
## price 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0     4028        1   540088   329387   210000   245000 
##      .25      .50      .75      .90      .95 
##   321950   450000   645000   887000  1156480 
## 
## lowest :   75000   78000   80000   81000   82000
## highest: 5350000 5570000 6885000 7062500 7700000
## --------------------------------------------------------------------------------
## bedrooms 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0       13    0.871    3.371    0.946        2        2 
##      .25      .50      .75      .90      .95 
##        3        3        4        4        5 
## 
## lowest :  0  1  2  3  4, highest:  8  9 10 11 33
##                                                                             
## Value          0     1     2     3     4     5     6     7     8     9    10
## Frequency     13   199  2760  9824  6882  1601   272    38    13     6     3
## Proportion 0.001 0.009 0.128 0.455 0.318 0.074 0.013 0.002 0.001 0.000 0.000
##                       
## Value         11    33
## Frequency      1     1
## Proportion 0.000 0.000
## --------------------------------------------------------------------------------
## bathrooms 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0       30    0.974    2.115   0.8444     1.00     1.00 
##      .25      .50      .75      .90      .95 
##     1.75     2.25     2.50     3.00     3.50 
## 
## lowest : 0.00 0.50 0.75 1.00 1.25, highest: 6.50 6.75 7.50 7.75 8.00
## --------------------------------------------------------------------------------
## sqft_living 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0     1038        1     2080    978.4      940     1090 
##      .25      .50      .75      .90      .95 
##     1427     1910     2550     3250     3760 
## 
## lowest :   290   370   380   384   390, highest:  9640  9890 10040 12050 13540
## --------------------------------------------------------------------------------
## sqft_lot 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0     9782        1    15107    17855     1800     3322 
##      .25      .50      .75      .90      .95 
##     5040     7618    10688    21398    43339 
## 
## lowest :     520     572     600     609     635
## highest:  982998 1024068 1074218 1164794 1651359
## --------------------------------------------------------------------------------
## floors 
##        n  missing distinct     Info     Mean      Gmd 
##    21613        0        6    0.823    1.494   0.5563 
## 
## lowest : 1.0 1.5 2.0 2.5 3.0, highest: 1.5 2.0 2.5 3.0 3.5
##                                               
## Value        1.0   1.5   2.0   2.5   3.0   3.5
## Frequency  10680  1910  8241   161   613     8
## Proportion 0.494 0.088 0.381 0.007 0.028 0.000
## --------------------------------------------------------------------------------
## waterfront 
##        n  missing distinct     Info      Sum     Mean      Gmd 
##    21613        0        2    0.022      163 0.007542  0.01497 
## 
## --------------------------------------------------------------------------------
## view 
##        n  missing distinct     Info     Mean      Gmd 
##    21613        0        5    0.267   0.2343   0.4322 
## 
## lowest : 0 1 2 3 4, highest: 0 1 2 3 4
##                                         
## Value          0     1     2     3     4
## Frequency  19489   332   963   510   319
## Proportion 0.902 0.015 0.045 0.024 0.015
## --------------------------------------------------------------------------------
## condition 
##        n  missing distinct     Info     Mean      Gmd 
##    21613        0        5    0.708    3.409   0.6161 
## 
## lowest : 1 2 3 4 5, highest: 1 2 3 4 5
##                                         
## Value          1     2     3     4     5
## Frequency     30   172 14031  5679  1701
## Proportion 0.001 0.008 0.649 0.263 0.079
## --------------------------------------------------------------------------------
## grade 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0       12    0.903    7.657    1.231        6        6 
##      .25      .50      .75      .90      .95 
##        7        7        8        9       10 
## 
## lowest :  1  3  4  5  6, highest:  9 10 11 12 13
##                                                                             
## Value          1     3     4     5     6     7     8     9    10    11    12
## Frequency      1     3    29   242  2038  8981  6068  2615  1134   399    90
## Proportion 0.000 0.000 0.001 0.011 0.094 0.416 0.281 0.121 0.052 0.018 0.004
##                 
## Value         13
## Frequency     13
## Proportion 0.001
## --------------------------------------------------------------------------------
## sqft_above 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0      946        1     1788    876.2      850      970 
##      .25      .50      .75      .90      .95 
##     1190     1560     2210     2950     3400 
## 
## lowest :  290  370  380  384  390, highest: 7880 8020 8570 8860 9410
## --------------------------------------------------------------------------------
## sqft_basement 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0      306    0.776    291.5    422.2        0        0 
##      .25      .50      .75      .90      .95 
##        0        0      560      970     1190 
## 
## lowest :    0   10   20   40   50, highest: 3260 3480 3500 4130 4820
## --------------------------------------------------------------------------------
## yr_built 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0      116        1     1971    33.38     1915     1926 
##      .25      .50      .75      .90      .95 
##     1951     1975     1997     2007     2011 
## 
## lowest : 1900 1901 1902 1903 1904, highest: 2011 2012 2013 2014 2015
## --------------------------------------------------------------------------------
## yr_renovated 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0       70    0.122     84.4    161.7        0        0 
##      .25      .50      .75      .90      .95 
##        0        0        0        0        0 
## 
## lowest :    0 1934 1940 1944 1945, highest: 2011 2012 2013 2014 2015
##                                                                             
## Value          0  1935  1940  1945  1950  1955  1960  1965  1970  1975  1980
## Frequency  20699     1     2     6     4    13    12    16    27    25    43
## Proportion 0.958 0.000 0.000 0.000 0.000 0.001 0.001 0.001 0.001 0.001 0.002
##                                                     
## Value       1985  1990  1995  2000  2005  2010  2015
## Frequency     88    99    84   112   156    82   144
## Proportion 0.004 0.005 0.004 0.005 0.007 0.004 0.007
## 
## For the frequency table, variable is rounded to the nearest 5
## --------------------------------------------------------------------------------
## zipcode 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0       70        1    98078    60.77    98004    98008 
##      .25      .50      .75      .90      .95 
##    98033    98065    98118    98155    98177 
## 
## lowest : 98001 98002 98003 98004 98005, highest: 98177 98178 98188 98198 98199
## --------------------------------------------------------------------------------
## lat 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0     5034        1    47.56   0.1573    47.31    47.35 
##      .25      .50      .75      .90      .95 
##    47.47    47.57    47.68    47.73    47.75 
## 
## lowest : 47.1559 47.1593 47.1622 47.1647 47.1764
## highest: 47.7771 47.7772 47.7774 47.7775 47.7776
## --------------------------------------------------------------------------------
## long 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0      752        1   -122.2   0.1558   -122.4   -122.4 
##      .25      .50      .75      .90      .95 
##   -122.3   -122.2   -122.1   -122.0   -122.0 
## 
## lowest : -122.519 -122.515 -122.514 -122.512 -122.511
## highest: -121.325 -121.321 -121.319 -121.316 -121.315
## --------------------------------------------------------------------------------
## sqft_living15 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0      777        1     1987    743.2     1140     1256 
##      .25      .50      .75      .90      .95 
##     1490     1840     2360     2930     3300 
## 
## lowest :  399  460  620  670  690, highest: 5600 5610 5790 6110 6210
## --------------------------------------------------------------------------------
## sqft_lot15 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##    21613        0     8689        1    12768    13404     1999     3667 
##      .25      .50      .75      .90      .95 
##     5100     7620    10083    17852    37063 
## 
## lowest :    651    659    660    748    750, highest: 434728 438213 560617 858132 871200
## --------------------------------------------------------------------------------
The results above showed us:
- The least count of bedrooms in a house is 0, the greatest count of bedrooms in a house is 33, and the most frequent count of the bedrooms in a house is 3 with the value count 9824 contributing 45.5 percentile of data in the variable.
- Most houses had 1 or 2 floor levels. 10680 or 49.4 percentile of the houses only had 1 floor level, and 8241 or 38.1 percentile of the house only had 2 floor levels. 
- The houses with the view rated at 0 were the houses that dominated the housing market, with 90.2 percentile of data contributed in the variable.
- The houses that were graded at level 3 were the ones dominating in the houses market. 14031 of the houses' condition were graded at level 3, contributing 64.9 percentile of the data in the variable.
- For the overall assessment of the houses, most house were graded at level 7 and 8. 8981 of the houses were graded at level 7, contributing 41.8 percentile of the data in the variable. 6068 of the houses were graded at level 8, contributing 28.1 percentile of the data in the variable.
- The summary above the most recent result indicated that the houses either was renovated at 2015 or not renovated at all. Through the most recent result, new insights were gained. There were several houses that were renovated in 1935, 1940, 1945, 1950, 1955, 1960, 1965, 1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010, 2015, and not renovated at all. There were 20699 house that were not renovated at all, contributing almost 96 percentile of the data in the variable. Other than that, most houses were renovated either in 2005 or 2015. 156 houses were renovated in 2005, and 144 houses were renovated in 2015. Each contributed at least 0.7 percentile of the data in the variable.
colSums(is.na(HouseData))
##            id          date         price      bedrooms     bathrooms 
##             0             0             0             0             0 
##   sqft_living      sqft_lot        floors    waterfront          view 
##             0             0             0             0             0 
##     condition         grade    sqft_above sqft_basement      yr_built 
##             0             0             0             0             0 
##  yr_renovated       zipcode           lat          long sqft_living15 
##             0             0             0             0             0 
##    sqft_lot15 
##             0
There are no missing values calculated using the syntax above. 
all(duplicated(HouseData)==TRUE)
## [1] FALSE
The syntax above returned the value FALSE. This indicates that there are no duplicated data in the data set. It can be concluded that the data set being used is free from missing values and duplicated data.

4. Check Anomalies

ThreeSigma <- function(x, t = 3){

 mu <- mean(x, na.rm = TRUE)
 sig <- sd(x, na.rm = TRUE)
 if (sig == 0){
  message("All non-missing x-values are identical")
 }
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
 }

Hampel <- function(x, t = 3){

 mu <- median(x, na.rm = TRUE)
 sig <- mad(x, na.rm = TRUE)
 if (sig == 0){
  message("Hampel identifer implosion: MAD scale estimate is zero")
 }
 up <- mu + t * sig
 down <- mu - t * sig
 out <- list(up = up, down = down)
 return(out)
}
   
BoxplotRule<- function(x, t = 1.5){

 xL <- quantile(x, na.rm = TRUE, probs = 0.25, names = FALSE)
 xU <- quantile(x, na.rm = TRUE, probs = 0.75, names = FALSE)
 Q <- xU - xL
 if (Q == 0){
  message("Boxplot rule implosion: interquartile distance is zero")
 }
 up <- xU + t * Q
 down <- xL - t * Q
 out <- list(up = up, down = down)
 return(out)
}   

ExtractDetails <- function(x, down, up){

 outClass <- rep("N", length(x))
 indexLo <- which(x < down)
 indexHi <- which(x > up)
 outClass[indexLo] <- "L"
 outClass[indexHi] <- "U"
 index <- union(indexLo, indexHi)
 values <- x[index]
 outClass <- outClass[index]
 nOut <- length(index)
 maxNom <- max(x[which(x <= up)])
 minNom <- min(x[which(x >= down)])
 outList <- list(nOut = nOut, lowLim = down,
 upLim = up, minNom = minNom,
 maxNom = maxNom, index = index,
 values = values,
 outClass = outClass)
 return(outList)
 }
FindOutliers <- function(x, t3 = 3, tH = 3, tb = 1.5){
 threeLims <- ThreeSigma(x, t = t3)
 HampLims <- Hampel(x, t = tH)
 boxLims <- BoxplotRule(x, t = tb)

 n <- length(x)
 nMiss <- length(which(is.na(x)))

 threeList <- ExtractDetails(x, threeLims$down, threeLims$up)
 HampList <- ExtractDetails(x, HampLims$down, HampLims$up)
 boxList <- ExtractDetails(x, boxLims$down, boxLims$up)

 sumFrame <- data.frame(method = "ThreeSigma", n = n,
 nMiss = nMiss, nOut = threeList$nOut,
 lowLim = threeList$lowLim,
 upLim = threeList$upLim,
 minNom = threeList$minNom,
 maxNom = threeList$maxNom)
 upFrame <- data.frame(method = "Hampel", n = n,
 nMiss = nMiss, nOut = HampList$nOut,
 lowLim = HampList$lowLim,
 upLim = HampList$upLim,
 minNom = HampList$minNom,
 maxNom = HampList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)
 upFrame <- data.frame(method = "BoxplotRule", n = n,
 nMiss = nMiss, nOut = boxList$nOut,
 lowLim = boxList$lowLim,
 upLim = boxList$upLim,
 minNom = boxList$minNom,
 maxNom = boxList$maxNom)
 sumFrame <- rbind.data.frame(sumFrame, upFrame)

 threeFrame <- data.frame(index = threeList$index,
 values = threeList$values,
 type = threeList$outClass)
 HampFrame <- data.frame(index = HampList$index,
 values = HampList$values,
 type = HampList$outClass)
 boxFrame <- data.frame(index = boxList$index,
 values = boxList$values,
 type = boxList$outClass)
 outList <- list(summary = sumFrame, threeSigma = threeFrame,
 Hampel = HampFrame, boxplotRule = boxFrame)
 return(outList)
}
summary_outliers <- FindOutliers(HouseData$price)
summary_outliers$summary
##        method     n nMiss nOut    lowLim   upLim minNom  maxNom
## 1  ThreeSigma 21613     0  406 -561293.4 1641470  75000 1640000
## 2      Hampel 21613     0 1166 -217170.0 1117170  75000 1115500
## 3 BoxplotRule 21613     0 1146 -162625.0 1129575  75000 1127500
avg <- mean(HouseData$price)
std <- sd(HouseData$price)

outliers_ts = sum(abs((HouseData$price - avg) > 3 * std))

upper_ts <- avg + 3 * std
lower_ts <- avg - 3 * std
outliers_ts <- list(up = upper_ts, down = lower_ts)

plot(HouseData$price,
     main = "Three Sigma",
     ylab = "Value",
     col = 'blue', ylim= c(-1e+6,5e+06))

abline(h = mean(HouseData$price), lty = "dashed", lwd = 1)
abline(h = upper_ts, lty = "dotted", lwd = 2)
abline(h = lower_ts, lty = "dotted", lwd = 2)

med <- median(HouseData$price)
sig <- mad(HouseData$price)

data <- HouseData$price
outliers_h <- sum(abs(data - med) > 3 * sig)

upper_h <- med + 3 * sig
lower_h <- med - 3 * sig
outliers_hampel <- list(up = upper_h, down = lower_h)

plot(HouseData$price,
     main = "Hampel Identifier",
     ylab = "Value",
     col = 'blue',ylim= c(-1e+6,5e+06))

abline(h = median(HouseData$price), lty = "dashed", lwd = 1)
abline(h = upper_h, lty = "dotted", lwd = 2)
abline(h = lower_h, lty = "dotted", lwd = 2)

out <- boxplot.stats(HouseData$price)$out

boxplot(HouseData$price,
  ylab = "",
  main = "House Price Boxplot"
)

mtext(paste("Outliers: ", paste(length(out), collapse = ", ")))

Using the Three Sigma Edit rule, the lower limit of the non-outlier data is -561293.4 with the minNom value is 75000, and the upper limit is 1641470 with the maxNom value is 1640000. As a result, there are 406 data that are considered outliers by this rule. Using the Hample Identifier rule, the lower limit of the non-outlier data is -217170.0 with the minNom value is 75000, and the upper limit is 1117170 with the maxNom value is 1115500. There are 1166 data that are considered outliers. Lastly, using the Boxplot rule, the lower limit of the non-outlier data is -162625.0 with the minNom value is 75000, and the upper limit is 1129575 with the maxNom value is 1127500. There are 1146 data that are considered outliers. The lower limit that seemed to be the most reasonable for the non-outlier data is from the Boxplot rule as the numbers did not stray far away from the minNom value. Compared to all of the upper limit from all of three outlier detector rule, the most reasonable upper limit for the non-outlier data is from the Hample Identifier rule. The upper limit from the Hample rule is the smallest among all of the other rules. The upper limit from Hample rule did not stray far away from the central distribution of the data. 
Using the combination of the lower limit and the upper limit of Three Sigma, Hample rule, and Boxplot rule, there are 406 data points, 1166 data points, and 1146 data points that are considered as outliers respectively. Referencing from the results from the FindOutliers() function and the plot above, the most reliable results for outliers identifier is from the Hample Identifier rule, which identified 1166 data points as outliers.
outlierIndex_table <- which(HouseData$price > 1117170 | HouseData$price < -162625.0)
slice_data <- slice(HouseData, outlierIndex_table)
head(slice_data)
##           id            date   price bedrooms bathrooms sqft_living sqft_lot
## 1 7237550310 20140512T000000 1225000        4      4.50        5420   101930
## 2 2524049179 20140826T000000 2000000        3      2.75        3050    44867
## 3  822039084 20150311T000000 1350000        3      2.50        2753    65005
## 4 1802000060 20140612T000000 1325000        5      2.25        3200    20158
## 5 4389200955 20150302T000000 1450000        4      2.75        2750    17789
## 6 7855801670 20150401T000000 2250000        4      3.25        5180    19850
##   floors waterfront view condition grade sqft_above sqft_basement yr_built
## 1    1.0          0    0         3    11       3890          1530     2001
## 2    1.0          0    4         3     9       2330           720     1968
## 3    1.0          1    2         5     9       2165           588     1953
## 4    1.0          0    0         3     8       1600          1600     1965
## 5    1.5          0    0         3     8       1980           770     1914
## 6    2.0          0    3         3    12       3540          1640     2006
##   yr_renovated zipcode     lat     long sqft_living15 sqft_lot15
## 1            0   98053 47.6561 -122.005          4760     101930
## 2            0   98040 47.5316 -122.233          4110      20336
## 3            0   98070 47.4041 -122.451          2680      72513
## 4            0   98004 47.6303 -122.215          3390      20158
## 5         1992   98004 47.6141 -122.212          3060      11275
## 6            0   98006 47.5620 -122.162          3160       9750
Outliers were discovered in the price variable after searching for anomalies in the dataset, but our team opted not to remove them since prices that are out of range might be generated by a variety of factors, such as growing material prices, rising labor prices, inflation, and rising cost of living and there are several additional factors that determine.

5. Visualizing Variable Relation and Pattern Discovery

canvas <- layout(matrix(c(1,2,3,4),nrow=2,byrow=TRUE))

plot(HouseData$sqft_living, HouseData$bathrooms,main="sqft_living VS Bathrooms", xlab="Sqft of Living Room",ylab="Bathrooms")
plot(HouseData$sqft_living, HouseData$sqft_living,main="sqft_living VS sqft_living15",xlab="Sqft of Living Room",ylab="Sqft of Living Room in 2015")
plot(HouseData$sqft_above, HouseData$sqft_living,main="sqft_above VS sqft_living",xlab="Sqft of the Above",ylab="Sqft of Living Room")
plot(HouseData$lat, HouseData$bedrooms,main="lat VS bedrooms",xlab="Lat",ylab="Bedrooms")

From the output above :
- 'sqft_living' and 'bathrooms' seems to be related. So, As the result, the higher the value of 'sqft_living', the higher the number of 'bathrooms', and vice versa.
- It reflect that 'sqft_living' and 'sqft_living15' variables have perfectly positive relation. So, As the result, the higher the value of 'sqft_living', the higher the value of 'sqft_living15', and vice versa.
- 'sqft_living' and 'sqft_above' variables are positively related. So, the higher the value of 'sqft_above', the higher the value of 'sqft_living', and vice versa.
- 'lat' and 'bedrooms' variables seems do not have relation, because it lacks neither ascending nor descending trend.

PATTERN DISCOVERY

HousePattern <- HouseData
HousePattern$view = as.factor(HousePattern$view)
ggplot(HousePattern, aes(x=view, y=price)) + geom_boxplot() + ggtitle("House Price vs. Rating View")

According to the output above, houses with views rated at 3 and 4 have higher house prices than houses with views at levels 0, 1, or 2.
HousePattern$bedrooms = as.factor(HousePattern$bedrooms)
ggplot(HousePattern, aes(x=bedrooms, y=price)) + geom_boxplot() + ggtitle("House Price vs. Number of Bedrooms")

Looking at the 2nd quartile of the data, houses with 9 bedrooms held the highest price. However, looking at the overall distribution of the data, houses with 8 bedrooms has the highest price.
HousePattern$bathrooms = as.factor(HousePattern$bathrooms)
ggplot(HousePattern, aes(x=bathrooms, y=price)) + geom_boxplot() + ggtitle("House Price vs. Number of Bathrooms")

The plot define that House prices below 1,000,000 dollars mostly have less than 3 bathrooms, and least houses have bathrooms count higher than 3, ranged in 3.25 and 3.5.
HousePattern$floors = as.factor(HousePattern$floors)
ggplot(HousePattern, aes(x=floors, y=price)) + geom_boxplot() + ggtitle("House Price vs. Number of Floors")

The result above shows that a house with a 2.5 floor has the highest price when compared to other floor levels.
HousePattern$waterfront = as.factor(HousePattern$waterfront)
ggplot(HousePattern, aes(x=waterfront, y=price)) + geom_boxplot()  + ggtitle("House Price vs. Number of Waterfront")

The result above shows that houses with a waterfront demand a higher house price than houses without a waterfront.
HousePattern$condition = as.factor(HousePattern$condition)
ggplot(HousePattern, aes(x=condition, y=price)) + geom_boxplot() + ggtitle("House Price vs. House Condition Rating")

Houses with conditions at levels 3, 4, and 5 are more expensive than houses with conditions at levels 1 or 2. The houses condition at level 5 is the most expensive.
HousePattern$grade = as.factor(HousePattern$grade)
ggplot(HousePattern, aes(x=grade, y=price)) + geom_boxplot() + ggtitle("House Price vs. House Grade")

The plot above shows a positive correlation between the house price and the house grade variables. It can be concluded that the higher the grade of a house, the higher the price of the house.
HousePattern$isRenovated <- as.logical(HousePattern$yr_renovated)
ggplot(HousePattern, aes(x=isRenovated, y=price)) + geom_boxplot() + ggtitle("House Price vs. is Renovated")

Renovated houses are likely to have a higher price compare to Non-Renovated houses.
mapview(HousePattern, xcol = "long", ycol = "lat", zcol = "price", crs = 4269, grid = FALSE)
The houses' price increase from South to North along the latitude and shows little variation along the longitude, from West to East.

SUMMARY EXPLANATION

Through Data Visualization and Pattern Discovery, we gathered the following information about the data : 

1. Houses with views rated at 3 and 4 have higher house prices than houses with views at levels 0, 1, or 2.

2. Looking at the 2nd quartile of the data, houses with 9 bedrooms held the highest price. However, looking at the overall distribution of the data, houses with 8 bedrooms has the highest price.

3. House prices below 1,000,000 dollars mostly have less than 3 bathrooms, and least houses have bathrooms count higher than 3, ranged in 3.25 and 3.5.

4. House with a 2.5 floor has the highest price when compared to other floor levels.

5. Houses with a waterfront demand a higher house price than houses without a waterfront.

6. Houses with conditions at levels 3, 4, and 5 are more expensive than houses with conditions at levels 1 or 2. The houses condition at level 5 is the most expensive.

7. The higher the grade of a house, the higher the price of the house.

8. Renovated houses are likely to have a higher price compare to Non-Renovated houses.

9. The houses' price increase from South to North along the latitude and shows little variation along the longitude, from West to East.

6. Check Correlation

House_num = HouseData[c("bedrooms", "bathrooms", "sqft_living", "sqft_lot", "floors", "waterfront", "view", "condition", "grade", "sqft_above", "sqft_basement", "yr_built", "yr_renovated", "zipcode", "lat", "long", "sqft_living15", "sqft_lot15", "price")]

rcorr(as.matrix(House_num))
##               bedrooms bathrooms sqft_living sqft_lot floors waterfront  view
## bedrooms          1.00      0.52        0.58     0.03   0.18      -0.01  0.08
## bathrooms         0.52      1.00        0.75     0.09   0.50       0.06  0.19
## sqft_living       0.58      0.75        1.00     0.17   0.35       0.10  0.28
## sqft_lot          0.03      0.09        0.17     1.00  -0.01       0.02  0.07
## floors            0.18      0.50        0.35    -0.01   1.00       0.02  0.03
## waterfront       -0.01      0.06        0.10     0.02   0.02       1.00  0.40
## view              0.08      0.19        0.28     0.07   0.03       0.40  1.00
## condition         0.03     -0.12       -0.06    -0.01  -0.26       0.02  0.05
## grade             0.36      0.66        0.76     0.11   0.46       0.08  0.25
## sqft_above        0.48      0.69        0.88     0.18   0.52       0.07  0.17
## sqft_basement     0.30      0.28        0.44     0.02  -0.25       0.08  0.28
## yr_built          0.15      0.51        0.32     0.05   0.49      -0.03 -0.05
## yr_renovated      0.02      0.05        0.06     0.01   0.01       0.09  0.10
## zipcode          -0.15     -0.20       -0.20    -0.13  -0.06       0.03  0.08
## lat              -0.01      0.02        0.05    -0.09   0.05      -0.01  0.01
## long              0.13      0.22        0.24     0.23   0.13      -0.04 -0.08
## sqft_living15     0.39      0.57        0.76     0.14   0.28       0.09  0.28
## sqft_lot15        0.03      0.09        0.18     0.72  -0.01       0.03  0.07
## price             0.31      0.53        0.70     0.09   0.26       0.27  0.40
##               condition grade sqft_above sqft_basement yr_built yr_renovated
## bedrooms           0.03  0.36       0.48          0.30     0.15         0.02
## bathrooms         -0.12  0.66       0.69          0.28     0.51         0.05
## sqft_living       -0.06  0.76       0.88          0.44     0.32         0.06
## sqft_lot          -0.01  0.11       0.18          0.02     0.05         0.01
## floors            -0.26  0.46       0.52         -0.25     0.49         0.01
## waterfront         0.02  0.08       0.07          0.08    -0.03         0.09
## view               0.05  0.25       0.17          0.28    -0.05         0.10
## condition          1.00 -0.14      -0.16          0.17    -0.36        -0.06
## grade             -0.14  1.00       0.76          0.17     0.45         0.01
## sqft_above        -0.16  0.76       1.00         -0.05     0.42         0.02
## sqft_basement      0.17  0.17      -0.05          1.00    -0.13         0.07
## yr_built          -0.36  0.45       0.42         -0.13     1.00        -0.22
## yr_renovated      -0.06  0.01       0.02          0.07    -0.22         1.00
## zipcode            0.00 -0.18      -0.26          0.07    -0.35         0.06
## lat               -0.01  0.11       0.00          0.11    -0.15         0.03
## long              -0.11  0.20       0.34         -0.14     0.41        -0.07
## sqft_living15     -0.09  0.71       0.73          0.20     0.33         0.00
## sqft_lot15         0.00  0.12       0.19          0.02     0.07         0.01
## price              0.04  0.67       0.61          0.32     0.05         0.13
##               zipcode   lat  long sqft_living15 sqft_lot15 price
## bedrooms        -0.15 -0.01  0.13          0.39       0.03  0.31
## bathrooms       -0.20  0.02  0.22          0.57       0.09  0.53
## sqft_living     -0.20  0.05  0.24          0.76       0.18  0.70
## sqft_lot        -0.13 -0.09  0.23          0.14       0.72  0.09
## floors          -0.06  0.05  0.13          0.28      -0.01  0.26
## waterfront       0.03 -0.01 -0.04          0.09       0.03  0.27
## view             0.08  0.01 -0.08          0.28       0.07  0.40
## condition        0.00 -0.01 -0.11         -0.09       0.00  0.04
## grade           -0.18  0.11  0.20          0.71       0.12  0.67
## sqft_above      -0.26  0.00  0.34          0.73       0.19  0.61
## sqft_basement    0.07  0.11 -0.14          0.20       0.02  0.32
## yr_built        -0.35 -0.15  0.41          0.33       0.07  0.05
## yr_renovated     0.06  0.03 -0.07          0.00       0.01  0.13
## zipcode          1.00  0.27 -0.56         -0.28      -0.15 -0.05
## lat              0.27  1.00 -0.14          0.05      -0.09  0.31
## long            -0.56 -0.14  1.00          0.33       0.25  0.02
## sqft_living15   -0.28  0.05  0.33          1.00       0.18  0.59
## sqft_lot15      -0.15 -0.09  0.25          0.18       1.00  0.08
## price           -0.05  0.31  0.02          0.59       0.08  1.00
## 
## n= 21613 
## 
## 
## P
##               bedrooms bathrooms sqft_living sqft_lot floors waterfront view  
## bedrooms               0.0000    0.0000      0.0000   0.0000 0.3332     0.0000
## bathrooms     0.0000             0.0000      0.0000   0.0000 0.0000     0.0000
## sqft_living   0.0000   0.0000                0.0000   0.0000 0.0000     0.0000
## sqft_lot      0.0000   0.0000    0.0000               0.4445 0.0015     0.0000
## floors        0.0000   0.0000    0.0000      0.4445          0.0005     0.0000
## waterfront    0.3332   0.0000    0.0000      0.0015   0.0005            0.0000
## view          0.0000   0.0000    0.0000      0.0000   0.0000 0.0000           
## condition     0.0000   0.0000    0.0000      0.1879   0.0000 0.0144     0.0000
## grade         0.0000   0.0000    0.0000      0.0000   0.0000 0.0000     0.0000
## sqft_above    0.0000   0.0000    0.0000      0.0000   0.0000 0.0000     0.0000
## sqft_basement 0.0000   0.0000    0.0000      0.0246   0.0000 0.0000     0.0000
## yr_built      0.0000   0.0000    0.0000      0.0000   0.0000 0.0001     0.0000
## yr_renovated  0.0056   0.0000    0.0000      0.2612   0.3514 0.0000     0.0000
## zipcode       0.0000   0.0000    0.0000      0.0000   0.0000 0.0000     0.0000
## lat           0.1892   0.0003    0.0000      0.0000   0.0000 0.0359     0.3654
## long          0.0000   0.0000    0.0000      0.0000   0.0000 0.0000     0.0000
## sqft_living15 0.0000   0.0000    0.0000      0.0000   0.0000 0.0000     0.0000
## sqft_lot15    0.0000   0.0000    0.0000      0.0000   0.0976 0.0000     0.0000
## price         0.0000   0.0000    0.0000      0.0000   0.0000 0.0000     0.0000
##               condition grade  sqft_above sqft_basement yr_built yr_renovated
## bedrooms      0.0000    0.0000 0.0000     0.0000        0.0000   0.0056      
## bathrooms     0.0000    0.0000 0.0000     0.0000        0.0000   0.0000      
## sqft_living   0.0000    0.0000 0.0000     0.0000        0.0000   0.0000      
## sqft_lot      0.1879    0.0000 0.0000     0.0246        0.0000   0.2612      
## floors        0.0000    0.0000 0.0000     0.0000        0.0000   0.3514      
## waterfront    0.0144    0.0000 0.0000     0.0000        0.0001   0.0000      
## view          0.0000    0.0000 0.0000     0.0000        0.0000   0.0000      
## condition               0.0000 0.0000     0.0000        0.0000   0.0000      
## grade         0.0000           0.0000     0.0000        0.0000   0.0341      
## sqft_above    0.0000    0.0000            0.0000        0.0000   0.0006      
## sqft_basement 0.0000    0.0000 0.0000                   0.0000   0.0000      
## yr_built      0.0000    0.0000 0.0000     0.0000                 0.0000      
## yr_renovated  0.0000    0.0341 0.0006     0.0000        0.0000               
## zipcode       0.6565    0.0000 0.0000     0.0000        0.0000   0.0000      
## lat           0.0281    0.0000 0.9045     0.0000        0.0000   0.0000      
## long          0.0000    0.0000 0.0000     0.0000        0.0000   0.0000      
## sqft_living15 0.0000    0.0000 0.0000     0.0000        0.0000   0.6944      
## sqft_lot15    0.6166    0.0000 0.0000     0.0111        0.0000   0.2483      
## price         0.0000    0.0000 0.0000     0.0000        0.0000   0.0000      
##               zipcode lat    long   sqft_living15 sqft_lot15 price 
## bedrooms      0.0000  0.1892 0.0000 0.0000        0.0000     0.0000
## bathrooms     0.0000  0.0003 0.0000 0.0000        0.0000     0.0000
## sqft_living   0.0000  0.0000 0.0000 0.0000        0.0000     0.0000
## sqft_lot      0.0000  0.0000 0.0000 0.0000        0.0000     0.0000
## floors        0.0000  0.0000 0.0000 0.0000        0.0976     0.0000
## waterfront    0.0000  0.0359 0.0000 0.0000        0.0000     0.0000
## view          0.0000  0.3654 0.0000 0.0000        0.0000     0.0000
## condition     0.6565  0.0281 0.0000 0.0000        0.6166     0.0000
## grade         0.0000  0.0000 0.0000 0.0000        0.0000     0.0000
## sqft_above    0.0000  0.9045 0.0000 0.0000        0.0000     0.0000
## sqft_basement 0.0000  0.0000 0.0000 0.0000        0.0111     0.0000
## yr_built      0.0000  0.0000 0.0000 0.0000        0.0000     0.0000
## yr_renovated  0.0000  0.0000 0.0000 0.6944        0.2483     0.0000
## zipcode               0.0000 0.0000 0.0000        0.0000     0.0000
## lat           0.0000         0.0000 0.0000        0.0000     0.0000
## long          0.0000  0.0000        0.0000        0.0000     0.0015
## sqft_living15 0.0000  0.0000 0.0000               0.0000     0.0000
## sqft_lot15    0.0000  0.0000 0.0000 0.0000                   0.0000
## price         0.0000  0.0000 0.0015 0.0000        0.0000
Threshold = 0.4

- As the threshold was set to 0.4, so the correlation below 0.4 has a weak relationship with the dependent value that is 'price'. 

- Because of the weak correlation between independent and dependent variables, the variables to be drop are as the followings: 'long','yr_built', 'zipcode', 'sqft_lot15', 'sqft_lot','yr_renovated','floors','condition', 'sqft_basement', and 'bedrooms'. 

- Because of the strong correlation between independent and independent variables, the variables to be drop is 'sqft_above'.

- Also, the variables 'waterfront'and 'lat' were kept because our team consider that :
'waterfront' : Houses with a waterfront generally have a higher price. So we opted to maintain the 'waterfront' variable since we assume the existence or absence of a waterfront influences the price of a house.
'lat' : The house price in a strategic latitude will affect the house price too.
#Remove Variable step by step
House_num$long <- NULL
House_num$yr_built <- NULL
House_num$zipcode <- NULL
House_num$sqft_lot15 <- NULL
House_num$sqft_lot <- NULL
House_num$yr_renovated <- NULL
House_num$floors <- NULL
House_num$condition <- NULL
House_num$sqft_basement <- NULL
House_num$bedrooms <- NULL
House_num$sqft_above <- NULL

rcorr(as.matrix(House_num))
##               bathrooms sqft_living waterfront view grade   lat sqft_living15
## bathrooms          1.00        0.75       0.06 0.19  0.66  0.02          0.57
## sqft_living        0.75        1.00       0.10 0.28  0.76  0.05          0.76
## waterfront         0.06        0.10       1.00 0.40  0.08 -0.01          0.09
## view               0.19        0.28       0.40 1.00  0.25  0.01          0.28
## grade              0.66        0.76       0.08 0.25  1.00  0.11          0.71
## lat                0.02        0.05      -0.01 0.01  0.11  1.00          0.05
## sqft_living15      0.57        0.76       0.09 0.28  0.71  0.05          1.00
## price              0.53        0.70       0.27 0.40  0.67  0.31          0.59
##               price
## bathrooms      0.53
## sqft_living    0.70
## waterfront     0.27
## view           0.40
## grade          0.67
## lat            0.31
## sqft_living15  0.59
## price          1.00
## 
## n= 21613 
## 
## 
## P
##               bathrooms sqft_living waterfront view   grade  lat   
## bathrooms               0.0000      0.0000     0.0000 0.0000 0.0003
## sqft_living   0.0000                0.0000     0.0000 0.0000 0.0000
## waterfront    0.0000    0.0000                 0.0000 0.0000 0.0359
## view          0.0000    0.0000      0.0000            0.0000 0.3654
## grade         0.0000    0.0000      0.0000     0.0000        0.0000
## lat           0.0003    0.0000      0.0359     0.3654 0.0000       
## sqft_living15 0.0000    0.0000      0.0000     0.0000 0.0000 0.0000
## price         0.0000    0.0000      0.0000     0.0000 0.0000 0.0000
##               sqft_living15 price 
## bathrooms     0.0000        0.0000
## sqft_living   0.0000        0.0000
## waterfront    0.0000        0.0000
## view          0.0000        0.0000
## grade         0.0000        0.0000
## lat           0.0000        0.0000
## sqft_living15               0.0000
## price         0.0000
corrplot(cor(House_num),type='lower',tl.col='black',tl.srt=45,col=COL2('BrBG'))

By deleting those variables with weak correlation with the dependent variable which is price, above is the result of each correlation between dependent and independent variables. 

7. Check Linearity

plot(House_num)

From this plot of linearity, it is shown that 'sqft_living' has a close to linearity relationship with 'sqft_living15' and 'price'. Most of the numerical independent variables have a linearity relationship with the dependent variable.

8. Check Normality

hist.data.frame(House_num)

None of these have a bell-shaped curve but some of them are closed to having a bell-shaped curve such as 'sqft_living', and 'sqft_living15' but it is skewed to the left.

9. Modelling

model1 = lm(price~., data = House_num)
summary(model1)
## 
## Call:
## lm(formula = price ~ ., data = House_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1125915  -110389   -15481    77229  4780130 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -3.257e+07  5.079e+05 -64.116  < 2e-16 ***
## bathrooms     -2.008e+04  2.991e+03  -6.715 1.93e-11 ***
## sqft_living    1.815e+02  3.234e+00  56.122  < 2e-16 ***
## waterfront     6.093e+05  1.856e+04  32.830  < 2e-16 ***
## view           7.055e+04  2.188e+03  32.244  < 2e-16 ***
## grade          8.183e+04  2.113e+03  38.734  < 2e-16 ***
## lat            6.751e+05  1.071e+04  63.052  < 2e-16 ***
## sqft_living15  6.875e+00  3.488e+00   1.971   0.0488 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 216000 on 21605 degrees of freedom
## Multiple R-squared:  0.654,  Adjusted R-squared:  0.6539 
## F-statistic:  5834 on 7 and 21605 DF,  p-value: < 2.2e-16
plot(model1, which = 1)

The output of Model 1 shows that the F-statistic is 5834, residual standard error is 216000 and adjusted R-squared is 0.6539.
model2 <- lm(log(price) ~ bathrooms + sqft_living  + view + lat  + waterfront  + grade, data = House_num)
summary(model2)
## 
## Call:
## lm(formula = log(price) ~ bathrooms + sqft_living + view + lat + 
##     waterfront + grade, data = House_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02137 -0.17326 -0.01028  0.16406  1.16286 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.941e+01  6.454e-01 -92.044  < 2e-16 ***
## bathrooms    1.079e-02  3.791e-03   2.845  0.00444 ** 
## sqft_living  2.109e-04  3.707e-06  56.886  < 2e-16 ***
## view         8.832e-02  2.770e-03  31.886  < 2e-16 ***
## lat          1.489e+00  1.361e-02 109.467  < 2e-16 ***
## waterfront   3.774e-01  2.358e-02  16.004  < 2e-16 ***
## grade        1.481e-01  2.538e-03  58.352  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2746 on 21606 degrees of freedom
## Multiple R-squared:  0.7283, Adjusted R-squared:  0.7283 
## F-statistic:  9654 on 6 and 21606 DF,  p-value: < 2.2e-16
plot(model2, which = 1)

Another test was conducted, which was the model2. The results were satisfactory, 0.7283 as the gained adjusted R-square value,0.2746 as the gained residual standard error,  and 9654 as the gained F-statistic value. This model is better than previous models. However, bathrooms variable only has 2 stars of signif. codes, so we decided to make another model below without bathrooms.
model3 <- lm(log(price) ~ sqft_living  + view + lat  + waterfront  + grade, data = House_num)
summary(model3)
## 
## Call:
## lm(formula = log(price) ~ sqft_living + view + lat + waterfront + 
##     grade, data = House_num)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.02493 -0.17342 -0.00971  0.16429  1.15752 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.932e+01  6.448e-01  -92.00   <2e-16 ***
## sqft_living  2.163e-04  3.184e-06   67.93   <2e-16 ***
## view         8.791e-02  2.767e-03   31.78   <2e-16 ***
## lat          1.488e+00  1.359e-02  109.44   <2e-16 ***
## waterfront   3.772e-01  2.359e-02   15.99   <2e-16 ***
## grade        1.497e-01  2.478e-03   60.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2746 on 21607 degrees of freedom
## Multiple R-squared:  0.7282, Adjusted R-squared:  0.7282 
## F-statistic: 1.158e+04 on 5 and 21607 DF,  p-value: < 2.2e-16
plot(model3, which = 1)

In the newest model, the variables that were fit the dependent variables were 'sqft_living', 'view', 'lat', 'waterfront' and 'grade'. From the summary of model3, it can be seen that model3 has 0.2746 residual standard error value that seems small enough, 0.7282 Adjusted R-squared value that seems high enough, 11580 F-statistic value that seems high enough, and all of the predictor has highly significant p values (<2e-16). This model is better than the previous model. Hence, it was decided to use model3 as the final model.

The best model, model3, is conducting linear regression for house price using the 'sqft_living', 'view', 'lat', 'waterfront' and 'grade' variables. The linear regression model was trained with log transformed house price.
The equation obtained from the final model is 
  log(price) = -5.932e+01 + sqft_living*(2.163e-04) + view*(8.791e-02) + lat(1.488e+00) + waterfront*(3.772e-01) + grade*(1.497e-01)

10. Create Training and Testing Set and Check Accuration

set.seed(1)
training_index = createDataPartition(House_num$price, p = 0.8, list = FALSE)
# p = 0.8 --> 80% of the records in the dataset divided to training set and 20% of the records to testing set.
testing_set = House_num[-training_index,]
trainingset = House_num[training_index,]

# The testing set will be used to check the accuracy of the final model
testing_set$Predicted <- predict(model3, testing_set)
Price <- testing_set$price
Predicted <- testing_set$Predicted
Residual <- Price - Predicted

actual_prediction <- data.frame(Price, Predicted, Residual)
# checking accuracy
cor(actual_prediction)
##               Price Predicted  Residual
## Price     1.0000000 0.7915863 1.0000000
## Predicted 0.7915863 1.0000000 0.7915859
## Residual  1.0000000 0.7915859 1.0000000
Evaluation of the accuracy of our final model, model3, was conducted and it is shown that our final model has the accuracy at least 79,2%.

11. Check Residual Normality

#check residual bell shaped or not 
hist(rstudent(model3), col = "thistle")

From the results above, it is shown that the accuracy of the model3 is quite high, standing in 79,2% of accuracy. From the histogram above, we can see that the residuals are fairly normal distributed, although it has a little skew, but not too dramatic. Our team concluded that the model was good.

12. Plot Final Model

predict_price <- predict(model3, testing_set)
linear_model <- lm(testing_set$price ~ exp(predict_price))
plot(exp(predict_price), testing_set$price, xlab="Predicted Price", ylab="Actual Price")
abline(linear_model)

The x-axis and y-axis displays the predicted prices and the actual prices from the dataset respectively. The estimated regression line is displayed as a diagonal line in the middle of the plot. Since most of data points lies fairly close to the estimated regression line, this tells us that the regression model does a pretty good job of fitting the data.
par(mfrow=c(2,2))
plot(model3)

From these 4 plots, it is shown in the 'Normal Q-Q' plot that the majority of the data points lie on the line, meaning that the distribution of residuals in our final model is normally distributed. From the 'Scale-Location plot', same as the 'Residuals vs Fitted plot', there is no visible pattern which is a good sign.
Based on the 'Residuals vs Fitted' plot, there are barely any curve on the line, which shows that there are no visible pattern in the plot.
From the qqplot, we can see that the majority of data points lie on the line, meaning that the distribution of residuals in our final model is normally distributed.
From the 'Residuals vs Leverage' plot, there are no points that fall outside of the upper right and lower left dashed line. This means there are no influential points in the regression model.
So, our team decided that we will deploy the model because the regression model is a very good fit for the data, and that it is perfectly fine to deploy the model.

CONCLUSION

- Every house increase in Square feet of living will increase the predicted price.
- The price of a house also determined by the view that the house possesses. Especially houses with views rated at 3 and 4 have higher house prices than houses with views at levels 0, 1, or 2.
- The higher the grade of a house, the higher the price of the house.
- Houses with a waterfront demand a higher house price than houses without a waterfront.
- The houses' price increases from South to North along the latitude.

However, it is important to note that the data for this resport was collected several years ago. In the years since, it would be interesting to examine how this factor affects house pricing in King County, USA today.